TD - Chapter details v2

Chapter details

Sentence splitter, tokenizer, lemmatizer, morphological analyzer, tagger

The linguistic processing on the segmentation, lemmatization and tagging levels is executed by the Pantera tagger, tightly integrated with the Morfeusz SGJP morphological analyzer. Pantera (http://code.google.com/p/pantera-tagger/) is a recently developed morphosyntactic rule-based Brill tagger of Polish. It uses an optimized version of Brill's algorithm, adapted for the specifics of inflectional languages. The tagging is performed in two steps, with a smaller set of morphosyntactic categories, disambiguated in the first run (part of speech, case, person) and the remaining ones in the second run. Due to the free word order nature of Polish, the original set of rule templates, as proposed by Brill, has been extended to cover larger contexts.

Morfeusz SGJP (http://sgjp.pl/morfeusz/index.html.en) is a morphological analyzer for Polish, also used for sentence- and token-level segmentation and lemmatisation of texts before the morphological part is applied. It uses positional tags starting with POS information followed by values of morphosyntactic categories, corresponding to the given part of speech. The current version of the tool is based on linguistic data coming from The Grammatical Dictionary of Polish by Zygmunt Saloni. Pantera is available on the GNU GPL v3 license and Morfeusz SGJP on the FreeBSD license.

NP extractor and MWU lemmatizer

NP extraction is performed by Spejd – an Open Source Shallow Parsing and Disambiguation Engine (http://zil.ipipan.waw.pl/Spejd/) based on a fully uniform formalism, both for constituency partial parsing and for morphosyntactic disambiguation — the same grammar rule may contain structure-building operations, as well as morphosyntactic correction and disambiguation operations. The formalism and the engine are more flexible than either the usual shallow parsing formalisms, which assume disambiguated input, or the usual unification-based formalisms, which couple disambiguation (via unification) with structure building. Current applications of Spejd include rule-based disambiguation, detection of multiword expressions, valence acquisition, and sentiment analysis. The functionality can be further extended by adding external lexical resources.

The Spejd processing is powered by a grammar of Polish developed by Katarzyna Głowińska for the task of syntactic group recognition in the National Corpus of Polish (http://www.nkjp.pl/). The MWU lemmatizer is a set of extensions to this grammar developed by Łukasz Degórski especially for ATLAS.

NE recognizer

The named entity recognition is performed by Nerf (http://clip.ipipan.waw.pl/Nerf), a statistical CRF-based named entity recognizer, trained over one million manually annotated subcorpus of the National Corpus of Polish and successfully used in the process of automated annotation of its total one billion segments.

The Nerf annotation model was recreated to maintain consistency with the general ATLAS annotation framework, defined to cover dates, money, percentage and time expressions, names of organizations, locations and people. Normalized versions of the entities are provided to facilitate extraction and comparisons (e.g. for dates and time: values conforming to the xsd:date and xsd:time types, for money: value with a ISO currency code).

Technical

Polish LPC

Sentence splitter, tokenizer, lemmatizer, morphological analyzer, tagger

NP extractor and MWU lemmatizer

NE recognizer